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Adaptive Boosting 


Roadmap 


@ Embedding Numerous Features: Kernel Models 
© Combining Predictive Features: I Models 


Lecture 7: Blending and Bagging 


blending known diverse hypotheses uniformly, 
linearly, or even non-linearly; obtaining diverse 
hypotheses from bootstrapped data 












Lecture 8: Adaptive Boosting 


e Motivation of Boosting 

e Diversity by Re-weighting 
e Adaptive Boosting Algorithm 
e Adaptive Boosting in Action 





© Distilling Implicit Features: Extraction Models 
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Adaptive Boosting Motivation of Boosting 


Apple Recognition Problem 


e is this a picture of an apple? 
e say, want to teach a class of 6 year olds 
e gather photos under CC-BY-2.0 license on Flicker 





(thanks to the authors below!) 


(APAL stands for Apple and Pear Australia Ltd) 


Dan Foy APAL 

https: https: 
//flic. Z/flyce. 
kr/p/jNQ55 kr/p/jzP1VB 
nachans APAL 

https: https: 
//flic. KHElic; 


kr/p/9XD7Ag kr/p/jzRe4u 


Hsuan-Tien Lin (NTU CSIE) 


adrianbartel 
https: 
//£lic. 
kr/p/bdy2hZ 


Jo Jakeman 
https: 
LI£Y36z 
kr/p/7jwtGp 


ANdrzej cH. 
https: 
//flic. 
kr/p/51DKA8 


APAL 

https: 
flics 
kr/p/jzPYNr 


Stuart Webster 
https: 
//£lic. 
kr/p/9C3Ybd 


APAL 

https: 
//flic. 
kr/p/jzScif 
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Adaptive Boosting Motivation of Boosting 


Apple Recognition Problem 


e is this a picture of an apple? 
e say, want to teach a class of 6 year olds 


e gather photos under CC-BY-2.0 license on Flicker 
(thanks to the authors below!) 


JS u E © © 





Mr. Roboto. Richard North Richard North Emilian Robert Nathaniel Mc- 
Vicol Queen 

https: https: https: https: https: 

// fiic. //flic; //£lic. //£lic. //£lic. 


kr/p/i5BN85 kr/p/bHhPkB kr/p/d8tGou kr/p/bpmGXW kr/p/pZvlMf 


PF o a DB m 


Crystal jfh686 skyseeker Janet Hudson Rennett Stowe 
https: https: https: https: https: 
IEC: J1Elie: //flic. LL £336. PSEC, 
kr/p/kaPYp kr/p/6vjRFH kr/p/2MynV kr/p/7QDBbm kr/p/agmnrk 
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Adaptive Boosting Motivation of Boosting 
Our Fruit Class Begins 
e Teacher: Please look at the pictures of apples and non-apples 


below. Based on those pictures, how would you describe an 
apple? Michael? 


e Michael: I think apples are circular. 





(Class): Apples are circular. 





Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/25 


Adaptive Boosting Motivation of Boosting 


Our Fruit Class Continues 


e Teacher: Being circular is a good feature for the apples. However, 
if you only say circular, you could make several mistakes. What 
else can we say for an apple? Tina? 


e Tina: It looks like apples are red. 







3 9 
9 9 
J “ 


(Class): Apples are somewhat circular and 
somewhat red. 
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Adaptive Boosting Motivation of Boosting 


Our Fruit Class Continues More 


e Teacher: Yes. Many apples are red. However, you could still make 
mistakes based on circular and red. Do you have any other 
suggestions, Joey? 


e Joey: Apples could also be green. 





(Class): Apples are somewhat circular and | 


somewhat red and possibly green. 
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Adaptive Boosting Motivation of Boosting 


Our Fruit Class Ends 


e Teacher: Yes. It seems that apples might be circular, red, green. 
But you may confuse them with tomatoes or peaches, right? Any 
more suggestions, Jessica? 


e Jessica: Apples have stems at the top. 





Pd © ial u 


(Class): Apples are somewhat circular, somewhat red, possibly green, 
and may have stems at the top. 
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Adaptive Boosting Motivation of Boosting 


Motivation 


x 


e students: simple hypotheses g; (like vertical/horizontal lines) 
e (Class): sophisticated hypothesis G (like black curve) 


e Teacher: a tactic learning algorithm that directs the students to 
focus on key examples 





next: the 'math’ of such an algorithm | 
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Adaptive Boosting Motivation of Boosting 


Fun Time 


Which of the following can help recognize an apple? 
Q apples are often circular 
@ apples are often red or green 
© apples often have stems at the top 
© all of the above 





Adaptive Boosting Motivation of Boosting 


Fun Time 


Which of the following can help recognize an apple? 
Q apples are often circular 
O apples are often red or green 
© apples often have stems at the top 
© all of the above 











Reference Answer: (4) 


Congratulations! You have passed first 
grade. :-) 
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Adaptive Boosting Diversity by Re-weighting 


Bootstrapping as Re-weighting Process 


D = {(x1, Y1), (X2, Y2), (X3, Y3), (X4. ya)} 
Dt = {(X1, y. (X, Yi). (Xe, Y2), (X4, Ya) } 


bootstrap 
= 








Enhi Y, 2 ho9l 


(x, y)e Dy 





(x1, y1), (69. yi) 
(X2. y2) 


[3 
wo 
No Wt 


2 
1 
0 
1 


(X4, ya) 








each diverse g; in bagging: 
by minimizing bootstrap-weighted error | 
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Adaptive Boosting Diversity by Re-weighting 


Weighted Base Algorithm 


minimize (regularized) 


N 
gu err(yn, A(Xn)) 





SES regression 











RZE > Uneffsym by dual QP PE 2 Unerrce by SGD 
e E upper bound e acd (Xn, yn) with 
0 < an < Cun probability proportional to un 
example-weighted learning: 
extension of class-weighted learning in Lecture 8 of ML Foundations | 
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Adaptive Boosting Diversity by Re-weighting 


Re-weighting for More Diverse Hypothesis 
‘improving’ bagging for binary classification: 
how to re-weight for more diverse hypotheses? | 


N 
gı + argmin » uj [yn # 2) 


heH n=1 


heH n=1 


N 
Jt+ı + argmin i» TEE [Yn # nox 


if g; ‘not good’ for u(!+!) — g,-like hypotheses not returned as 944 
=> gr.1 diverse from gr 





idea: construct u(^ to make g; random-like 


N p zg(x) 1 
pu , ui 
n= 
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Adaptive Boosting Diversity by Re-weighting 
‘Optimal’ Re-weighting 


Zum Mg _1 
>= ut Eton 2 











want: , where 


N 
Bi: = > un [Yn A gi(X;)], © = Yoh HE [Yn = 9:(Xn)] 





(t+1) 


(t+1) 


of incorrect) = (total uj ` ’ of correct) 


B... 0... 
e one possibility by re-scaling (multiplying) weights, if 
(total u of incorrect) = 1126 ; | (total y) of correct) = 6211 ; 


(weighted incorrect rate) = 7335 | (weighted correct rate) = $217 


e need: (total uj, 























incorrect: u ^) — u(0.6211 | correct: uL ^ — u® . 1126 








'optimal' re-weighting under weighted incorrect rate e;: 
multiply incorrect œ (1 — et); multiply correct x e; 


Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/25 


Adaptive Boosting Diversity by Re-weighting 


Fun Time 


For four examples with u" — 1 for all examples. If g4 predicts the first 
example wrongly but all the other three examples correctly. After the 
‘optimal’ re-weighting, what is u ju» 


0 4 


e9 3 
e 1/3 
© 1/4 





Adaptive Boosting Diversity by Re-weighting 


Fun Time 


For four examples with u" = 1 for all examples. If g; predicts the first 
example wrongly but all the other three examples correctly. After the 


‘optimal’ re-weighting, what is u(? /u'?)? 
© 4 
03 
© 1/3 
© 1/4 


Reference Answer: (2) 


By ‘optimal’ re-weighting, u; is scaled 
proportional to 2 and every other un is scaled 
proportional to i. So example 1 is now three 
times more important than any other example. | 
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Adaptive Boosting Adaptive Boosting Algorithm 


Scaling Factor 


N (t) 
‘optimal’ re-weighting: let c, = Zr n9» 





=. 1 un ? 
n= 


multiply incorrect œ (1 — et); multiply correct œ e; 





define scaling factor ¢; = ,/1— 


€t 





incorrect + incorrect : €, 
correct + correct / %: 


e equivalent to optimal re-weighting 
e @:>1iffer< 4 


—physical meaning: scale up incorrect; scale down correct 
—like what Teacher does 





scaling-up incorrect examples 
leads to diverse hypotheses 


Machine Learning Techniques 
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Adaptive Boosting Adaptive Boosting Algorithm 


A Preliminary Algorithm 
u) =? 
fort 1,2,0509 T 
© obtain g; by A(D, u(?), 
where A tries to minimize u(?-weighted 0/1 error 


O update ul” to ul) by 4; = je 





et ? 


where c; — weighted error (incorrect) rate of g; 
return G(X) =? 





» want g; ‘best’ for En: uf?) = n 
e G(x): 

e uniform? but go very bad for Ei, (why? :-)) 
e linear, non-linear? as you wish 


next: a special algorithm to aggregate 
linearly on the fly with theoretical guarantee | 


Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/25 





Adaptive Boosting Adaptive Boosting Algorithm 


Linear Aggregation on the Fly 


u(? = [5,4 A] 
font 2 0 T 


© obtain g; by .A(D, u(?), where ... 
@ update u) to ul) by 4, = ,/ =, where ... 
© compute o; = In(;) 

return G(X) — sign GE argi(x)) 










* Wish: large o; for ‘good’ gt == a; = monotonic(#;) 

e will take o; = In(#;) 

© 6 = $ — $21 = a; = 0 (bad g zero weight) 

e €; — 0 — $; = œ = a; = œ (super g; superior weight) 




















Adaptive Boosting = weak base learning algorithm A (Student) 
+ optimal re-weighting factor 4; (Teacher) 
+ ‘magic’ linear aggregation a; (Class) 
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Adaptive Boosting Adaptive Boosting Algorithm 


Adaptive Boosting (AdaBoost) Algorithm 


ul) = TM WI 
(Oli = V2 T 
© obtain g: Br A(D,u®), 
where A tries to minimize u(?-weighted 0/1 error 
© update u(? to u+" by 


[yn 4 g«(X5)] (incorrect examples): uf? + uP . 4, 
[yn = 9t(Xn)] (correct examples): — u^? e uP / e 





where $, = j= and e; = =a! 4 Džona 


Te 


© compute o; = In($;) 
return G(x) — sign oe argi(X)) 





AdaBoost: provable boosting property | 
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Adaptive Boosting Adaptive Boosting Algorithm 


Theoretical Guarantee of AdaBoost 
e From VC bound 





N 


log N 
Eout(G) € En(G)+O| || O(dvc(?0) T log T)- 2 
N — — 
Ac of all possible G 


+ first term can be small: 

Ein(G) = 0 after T = O(log N) iterations if «p < « < $ always 
e second term can be small: 

overall dyc grows "slowly" with T 





boosting view of AdaBoost: 


if A is weak but always slightly better than random (e+ < € < 1), 
then (AdaBoost+.A) can be strong (Ej, = 0 and Eout small) 





Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/25 


Adaptive Boosting Adaptive Boosting Algorithm 





Fun Time 
According to a; = In(@;), and ; = Il when would a; > 0? 
0 << 1 
e «> 1 
e «71 


Oo «70 





Adaptive Boosting Adaptive Boosting Algorithm 





Fun Time 
According to a; = In(4;), and $; = E. when would o; > 0? 
© «< 1 
® c^ 1 
e «71 


9 «70 





Reference Answer: G) 


The math part should be easy for you, and it is 
interesting to think about the physical meaning: 
ar > 0 (g+ is useful for G) if and only if the 

weighted error rate of g; is better than random! 
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Adaptive Boosting Adaptive Boosting in Action 


Decision Stump 


want: a ‘weak’ base learning algorithm A 
that minimizes EU(h)= À ^" , Un- [ys # h(Xn)] a little bit 










a popular choice: decision stump 
e in ML Foundations Homework 2, remember? :-) 





hg i (x) = 5: sign(x; — 0) 
e positive and negative rays on some feature: three parameters 
(feature /, threshold 6, direction s) 
e physical meaning: vertical/horizontal lines in 2D 
e efficient to optimize: O(d - Nlog N) time 


decision stump model: 
allows efficient minimization of Ev 
but perhaps too weak to work by itself 
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initially 


Adaptive Boosting Adaptive Boosting in Action 


A Simple Data Set 
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Adaptive Boosting Adaptive Boosting in Action 


A Simple Data Set 





Adaptive Boosting ‚Adaptive Boosting in Action 


A Simple Data Set 
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Adaptive Boosting ‚Adaptive Boosting in Action 


A Simple Data Set 
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Adaptive Boosting ‚Adaptive Boosting in Action 


A Simple Data Set 


x 


‘Teacher’-like algorithm works! 
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Adaptive Boosting Adaptive Boosting in Action 


A Complicated Data Set 


t= 100 





AdaBoost-Stump: non-linear yet efficient | 
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Adaptive Boosting Adaptive Boosting in Action ; E . 
AdaBoost-Stump in Application 


+ 







of 


original picture by F.U.S.I.A. assistant and derivative work by Sylenius via Wikimedia Commons 


The World's First 'Real-Time' Face Detection Program 


e AdaBoost-Stump as core model: linear aggregation of key 
patches selected out of 162,336 possibilities in 24x24 images 
—feature selection achieved through AdaBoost-Stump 

e modified linear aggregation G to rule out non-face earlier 

—efficiency achieved through modified linear aggregation 


AdaBoost-Stump: 
efficient feature selection and aggregation 
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Adaptive Boosting Adaptive Boosting in Action 


Fun Time 


For a data set of size 9876 that contains x, € IR°°®, after running 
AdaBoost-Stump for 1126 iterations, what is the number of distinct 
features within x that are effectively used by G? 


© 0 < number < 1126 
@ 1126 < number < 5566 
© 5566 « number < 9876 


€ 9876 « number 





Adaptive Boosting Adaptive Boosting in Action 


Fun Time 


For a data set of size 9876 that contains x, € R®°®®, after running 
AdaBoost-Stump for 1126 iterations, what is the number of distinct 
features within x that are effectively used by G? 

© 0 < number < 1126 

@ 1126 < number < 5566 

© 5566 < number < 9876 

© 9876 < number 


Reference Answer: (1) 


Each decision stump takes only one feature. 
So 1126 decision stumps need at most 1126 
distinct features. 
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id 


Summary 


@ Embedding Numerous Features: Kernel Models 
O Combining Predictive Features: Aggregation Models 


Lecture 8: Adaptive Boosting 


e Motivation of Boosting 
aggregate weak hypotheses for strength 
e Diversity by Re-weighting 
scale up incorrect, scale down correct 
e Adaptive Boosting Algorithm 
two heads are better than one, theoretically 
e Adaptive Boosting in Action 
AdaBoost-Stump useful and efficient 





e next: learning conditional aggregation instead of linear one 


© Distilling Implicit Features: Extraction Models 





